July 6, 2019

Preview

  1. Introduction to single-cell RNA-seq
  2. Quality control and normalization
  3. Survey of downstream analysis methodology

Highly variable genes

  • Goal: identify a set of genes with high variability attributable to biology (over and above technical variability)
  • Useful for vizualization, dimension reduction, clustering, marker gene selection, etc
  • Challenging without orthogonal measurement of technical variability (e.g. spikeins)
    • with spikeins: select genes with variance significantly above mean-variance trend in control genes
    • without spikeins: select genes with variance significantly above overall mean-variance trend in all endogenous genes (assumes variance of most genes is purely technical)

Highly variable genes with spikeins

Highly variable genes without spikeins

Review: Dimension reduction

  • Useful to summarize & visualize relationships between cells in low dimensional space
  • Commonly used approaches:
    • PCA (Principal Components Analysis)
    • tSNE (t-distributed Stochastic Neighbor Embedding)
    • UMAP (Uniform Manifold Approximation and Projection)
  • Clustering can be carried out on reduced dimensions, but with caution

Review: PCA vs Nonlinear methods

  • PCA attempts to extract the largest components of variation in the data
  • Nonlinear methods such as tSNE and UMAP attempt to map points to a global coordinate system that preserves local structure
    • density & distance of points not preserved
    • better at visualizing rare subtypes

image source: http://carbonandsilicon.net/rblogging/2018/02/27/UMAP_plots

General purpose clustering

  • Hierarchical: build and cut dendrogram based on pairwise distance matrix
  • K-means: iteratively assign cells to nearest cluster center
  • Density-based: clusters are defined as areas of higher density (e.g. DBSCAN)
  • Graph-based: assumes data can be represented as a graph structure; doesn’t require estimating pairwise distance matrix (e.g. SNN-Cliq)

drawingdrawingdrawing

Clustering for single-cell

  • Goal: automatically identify subpopulations of cell types/states
  • Many methods adapted/developed for single-cell to account for:
    • high dropout rate
    • batch effects
    • high dimensionality

Andrews & Hemburg 2018 (https://doi.org/10.1016/j.mam.2017.07.002)

Evaluation of clustering results

  • Many clustering algorithms require specification of the number of clusters (e.g. K in K-means)
  • Different algorithms may provide vastly different clusterings

Risso et al 2018 (https://doi.org/10.1371/journal.pcbi.1006378)

Metrics for evaluating clustering results

  • Single clustering:
    • Average silhouette: how similar a cell is to its own cluster (average distance within) compared to other clusters (average distance between); not efficient for large numbers of cells \[s(b) = \frac{\bar{d}_{\text{within}}(b) - \bar{d}_{\text{between}}(b)}{max(\bar{d}_{\text{within}}(b), \bar{d}_{\text{between}}(b))}\]
    • Modularity score: difference between number of within-cluster edges to the expected number under the null (random edges)
  • Comparing two clusterings
    • Adjusted rand index: similarity measure between two clusterings based on number of cells grouped accordingly and adjusted for random groupings

Benchmark of clustering methods for single-cell

Review: Differential expression

  • Traditional differential expression for bulk RNA-seq aims at detecting shifts in mean (fold change)
  • Read counts \(y_{bg}\) typically modeled with a two parameter distribution (e.g. mean \(\mu_{bg}\), dispersion \(\alpha_g\))
  • Recall DESeq2 model given size factors \(s_b\): \[ y_{bg} \sim NB(\mu_{bg}, \alpha_g)\] \[\mu_{bg}=s_b q_{bg}\] \[ log_2(q_{bg}) = x_{b.}\beta_g \]

Bulk RNA-seq measures averages

Heterogeneity hidden in bulk RNA-seq

Distributions of single-cell read counts

Differential distributions

scDD framework

  1. Test for global difference in distribution of expressed values with nonparametric two-sample Kolmogorov-Smirnov test: \[ D_{n,m} = sup_x |F_{1,n}(x) - F_{2,m}(x)|\] \[ \text{Reject at level } \alpha \text{ if } D_{n,m} > \sqrt{\frac{-(n+m)ln(\alpha)}{2mn}} \]
  2. Test for difference in dropout: GLM adjusted for cellular detection rate
  3. Classify significant genes into summary patterns:
  • Fit \(K-\) component normal mixture models to normalized log transformed expression values within each condition
  • Select \(K\) by BIC
  • Select pattern based on number of components & their equality of means

scDD detects more subtle and complex changes

DE after clustering

  • A common workflow will include detection of DE genes between discovered subtypes (clusters)
  • Caution must be used in the interpretation of p-values here
  • Since clusters are formed by maximizing differences in expression, it is not surprising to discover DE genes between clusters - we can even detect changes in null data
  • We may only be interested in differences in average expression here

Alternative approach for DE in single cell

  • MAST: Finak et al. 2015 (http://doi.org/10.1186/s13059-015-0844-5)
  • Hurdle model that depends on whether gene \(g\) in cell \(b\) is expressed (Normalized log count \(Y_{bg} > 0\)) or not (\(Y_{bg} = 0\)): \[ logit(P(Y_{bg} = 0)) = X_b\beta^D_g\] \[ P(Y_{bg} | Y_{bg} > 0) \sim N(X_b\beta_g^C, \sigma_g^2) \]
  • Obtain \(\chi_{n}^{2}\) and \(\chi_{m}^{2}\) likelihood ratio test statistics for \(\beta_g^D\) and \(\beta_g^C\), with \(n\) and \(m\) degrees of freedom, respectively
  • Global test statistic for a change in either component is the sum of the two:
    • sum of \(\chi^2\) random variables yields another \(\chi^2\) with degrees of freedom equal to the sum of degrees of freedom of the two original variables: \[\chi_m^{2} + \chi_n^{2} \sim \chi_{n+m}^{2}\]

Many other approaches for DE analysis

Scaling up to millions of cells with Bioconductor

  • Some datasets may be so large that they do not fit into memory efficiently (or at all)
  • Handy infrastructure for computing on data that is saved on disk - allows computation on subsets:
    • HDF5 - file format interfaced with read/write functionalities in Bioconductor packages (rhdf5, Rhdf5lib, beachmat)
    • DelayedArray, DelayedMatrixStats - Bioconductor packages that provide an ecosystem for storing and computing on disk-backed files just like you would ordinary matrices
  • Configured for use in SingleCellExperiment and interfaces automatically with scater and scran
    • requires saving data as HDF5 format

HDF5 in action - 1.3 million cells

## <27998 x 1306127> HDF5Matrix object of type "integer":
##                [,1]       [,2]       [,3]       [,4] ... [,1306124]
##     [1,]          0          0          0          0   .          0
##     [2,]          0          0          0          0   .          0
##     [3,]          0          0          0          0   .          0
##     [4,]          0          0          0          0   .          0
##     [5,]          0          0          0          0   .          0
##      ...          .          .          .          .   .          .
## [27994,]          0          0          0          0   .          0
## [27995,]          1          0          0          2   .          0
## [27996,]          0          0          0          0   .          0
## [27997,]          0          0          0          0   .          0
## [27998,]          0          0          0          0   .          0
##          [,1306125] [,1306126] [,1306127]
##     [1,]          0          0          0
##     [2,]          0          0          0
##     [3,]          0          0          0
##     [4,]          0          0          0
##     [5,]          0          0          0
##      ...          .          .          .
## [27994,]          0          0          0
## [27995,]          1          0          0
## [27996,]          0          0          0
## [27997,]          0          0          0
## [27998,]          0          0          0

Resources for handling large data

R tools for downstream analysis

  • scran: detection of highly variable genes, dimension reduction, clustering (Bioconductor)
  • Seurat: detection of highly variable genes, dimension reduction, clustering, DE (CRAN)
  • mbkmeans: minibatch K-means for large data (Bioconductor)
  • scDD: differential distributions (Bioconductor)
  • MAST: differential expression (Bioconductor)

drawing

There are many more tools I didn’t mention…

Growing number of computational tools

Curated list of tools from Sean